This R Markdown will be exploring the housing sale prices in King County, USA between the time period May 2014 and May 2015.
After calling all the required packages and installing them if needed, loading and reading the dataset, I will go through a data visualization / an exploratory data analysis (EDA) to identify the most important features and the correlation between them.
Secondly, Feature Engineering will be conducted. It is the process of using domain knowledge of the data to create features which, if done correctly, will increase the predictive power of the machine learning algorithms by creating features from raw data that help facilitate the machine learning process.
Finally, I will be applying different machine learning algorithms and evaluating their respective success to a cross-validated splitted train-test set.
The train-test splitting function that will be used later on
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/1.5))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
train <- read.csv (url("https://gist.githubusercontent.com/CelineKhoury/53ddc466a65e9ae1cbaa2773638b87b9/raw/3baf60131c21bb35782e043a5bc77a6f386184a4/house_price_train.csv"))
test <- read.csv (url("https://gist.githubusercontent.com/CelineKhoury/2ea2e63bf418d91270812f2a1f30ef2b/raw/8f6a575a60da419e0a8dc029ad17d1db235939a8/house_price_test.csv"))
summary(train)
## id date price bedrooms
## Min. :1.000e+06 23/06/2014: 118 Min. : 78000 Min. : 1.000
## 1st Qu.:2.114e+09 25/06/2014: 113 1st Qu.: 320000 1st Qu.: 3.000
## Median :3.902e+09 26/06/2014: 110 Median : 450000 Median : 3.000
## Mean :4.566e+09 22/04/2015: 108 Mean : 539865 Mean : 3.369
## 3rd Qu.:7.303e+09 08/07/2014: 104 3rd Qu.: 645500 3rd Qu.: 4.000
## Max. :9.900e+09 14/04/2015: 102 Max. :7700000 Max. :33.000
## (Other) :16622
## bathrooms sqft_living sqft_lot floors
## Min. :0.500 Min. : 370 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1430 1st Qu.: 5050 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7620 Median :1.500
## Mean :2.114 Mean : 2080 Mean : 15186 Mean :1.493
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10695 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1164794 Max. :3.500
##
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 3.00
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.00
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.00
## Mean :0.007467 Mean :0.2335 Mean :3.413 Mean : 7.66
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.00
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.00
##
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 370 Min. : 0.0 Min. :1900 Min. : 0.00
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00
## Median :1564 Median : 0.0 Median :1975 Median : 0.00
## Mean :1791 Mean : 289.4 Mean :1971 Mean : 85.35
## 3rd Qu.:2210 3rd Qu.: 556.0 3rd Qu.:1997 3rd Qu.: 0.00
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.00
##
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 460
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1986
## 3rd Qu.:98117 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
##
## sqft_lot15
## Min. : 659
## 1st Qu.: 5100
## Median : 7639
## Mean : 12826
## 3rd Qu.: 10080
## Max. :871200
##
summary(test)
## id date bedrooms bathrooms
## Min. :5.200e+06 09/07/2014: 35 Min. :1.00 Min. :0.500
## 1st Qu.:2.193e+09 23/04/2015: 30 1st Qu.:3.00 1st Qu.:1.500
## Median :3.972e+09 23/07/2014: 30 Median :3.00 Median :2.250
## Mean :4.637e+09 28/04/2015: 30 Mean :3.39 Mean :2.123
## 3rd Qu.:7.367e+09 20/05/2014: 29 3rd Qu.:4.00 3rd Qu.:2.500
## Max. :9.842e+09 16/07/2014: 28 Max. :9.00 Max. :6.750
## (Other) :4138
## sqft_living sqft_lot floors waterfront
## Min. : 420 Min. : 649 Min. :1.0 Min. :0.00000
## 1st Qu.:1430 1st Qu.: 5026 1st Qu.:1.0 1st Qu.:0.00000
## Median :1920 Median : 7586 Median :1.5 Median :0.00000
## Mean :2081 Mean : 14755 Mean :1.5 Mean :0.00787
## 3rd Qu.:2530 3rd Qu.: 10654 3rd Qu.:2.0 3rd Qu.:0.00000
## Max. :9200 Max. :1651359 Max. :3.5 Max. :1.00000
##
## view condition grade sqft_above
## Min. :0.0000 Min. :1.000 Min. : 4.000 Min. : 420
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1200
## Median :0.0000 Median :3.000 Median : 7.000 Median :1550
## Mean :0.2375 Mean :3.399 Mean : 7.651 Mean :1780
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2210
## Max. :4.0000 Max. :5.000 Max. :13.000 Max. :6720
##
## sqft_basement yr_built yr_renovated zipcode
## Min. : 0.0 Min. :1900 Min. : 0.00 Min. :98001
## 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00 1st Qu.:98033
## Median : 0.0 Median :1975 Median : 0.00 Median :98070
## Mean : 301.1 Mean :1971 Mean : 80.92 Mean :98079
## 3rd Qu.: 600.0 3rd Qu.:1997 3rd Qu.: 0.00 3rd Qu.:98118
## Max. :3500.0 Max. :2015 Max. :2015.00 Max. :98199
##
## lat long sqft_living15 sqft_lot15
## Min. :47.16 Min. :-122.5 Min. : 399 Min. : 651
## 1st Qu.:47.48 1st Qu.:-122.3 1st Qu.:1490 1st Qu.: 5100
## Median :47.57 Median :-122.2 Median :1840 Median : 7556
## Mean :47.56 Mean :-122.2 Mean :1989 Mean : 12489
## 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360 3rd Qu.: 10116
## Max. :47.78 Max. :-121.4 Max. :5790 Max. :858132
##
dim(train)
## [1] 17277 21
dim(test)
## [1] 4320 20
test$price <- NA
train$dataType <- "train"
test$dataType <- "test"
dim(train)
## [1] 17277 22
dim(test)
## [1] 4320 22
identical(names(train), names(test))
## [1] FALSE
dataset <- rbind(train, test)
summary(dataset)
## id date price bedrooms
## Min. :1.000e+06 23/06/2014: 142 Min. : 78000 Min. : 1.000
## 1st Qu.:2.123e+09 25/06/2014: 131 1st Qu.: 320000 1st Qu.: 3.000
## Median :3.905e+09 26/06/2014: 131 Median : 450000 Median : 3.000
## Mean :4.580e+09 08/07/2014: 127 Mean : 539865 Mean : 3.373
## 3rd Qu.:7.309e+09 27/04/2015: 126 3rd Qu.: 645500 3rd Qu.: 4.000
## Max. :9.900e+09 25/03/2015: 123 Max. :7700000 Max. :33.000
## (Other) :20817 NA's :4320
## bathrooms sqft_living sqft_lot floors
## Min. :0.500 Min. : 370 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1430 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.116 Mean : 2080 Mean : 15099 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10685 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
##
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.00 Min. : 3.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.00 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.00 Median : 7.000
## Mean :0.007547 Mean :0.2343 Mean :3.41 Mean : 7.658
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.00 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.00 Max. :13.000
##
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 370 Min. : 0.0 Min. :1900 Min. : 0.00
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00
## Median :1560 Median : 0.0 Median :1975 Median : 0.00
## Mean :1789 Mean : 291.7 Mean :1971 Mean : 84.46
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.00
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.00
##
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
##
## sqft_lot15 dataType
## Min. : 651 Length:21597
## 1st Qu.: 5100 Class :character
## Median : 7620 Mode :character
## Mean : 12758
## 3rd Qu.: 10083
## Max. :871200
##
str(dataset)
## 'data.frame': 21597 obs. of 22 variables:
## $ id : num 9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
## $ date : Factor w/ 372 levels "01/02/2015","01/03/2015",..: 155 329 219 355 277 241 201 199 319 255 ...
## $ price : num 225000 641250 810000 330000 530000 ...
## $ bedrooms : int 3 3 4 4 4 4 4 3 4 3 ...
## $ bathrooms : num 1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
## $ sqft_living : int 1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
## $ sqft_lot : int 7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
## $ floors : num 1 3 2 1 1 2 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 2 2 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 4 4 3 3 3 3 3 ...
## $ grade : int 7 10 9 7 7 9 10 8 9 8 ...
## $ sqft_above : int 1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
## $ sqft_basement: int 0 0 0 0 870 640 0 310 0 1040 ...
## $ yr_built : int 1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ lat : num 47.4 47.7 47.6 47.8 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
## $ sqft_lot15 : int 7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
## $ dataType : chr "train" "train" "train" "train" ...
We have 20 columns and the target variable is the “price”. Let’s analyze the features that we have to check if we need to clean or preprocess them.
colSums(is.na(dataset))
## id date price bedrooms bathrooms
## 0 0 4320 0 0
## sqft_living sqft_lot floors waterfront view
## 0 0 0 0 0
## condition grade sqft_above sqft_basement yr_built
## 0 0 0 0 0
## yr_renovated zipcode lat long sqft_living15
## 0 0 0 0 0
## sqft_lot15 dataType
## 0 0
str(dataset)
## 'data.frame': 21597 obs. of 22 variables:
## $ id : num 9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
## $ date : Factor w/ 372 levels "01/02/2015","01/03/2015",..: 155 329 219 355 277 241 201 199 319 255 ...
## $ price : num 225000 641250 810000 330000 530000 ...
## $ bedrooms : int 3 3 4 4 4 4 4 3 4 3 ...
## $ bathrooms : num 1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
## $ sqft_living : int 1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
## $ sqft_lot : int 7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
## $ floors : num 1 3 2 1 1 2 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 2 2 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 4 4 3 3 3 3 3 ...
## $ grade : int 7 10 9 7 7 9 10 8 9 8 ...
## $ sqft_above : int 1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
## $ sqft_basement: int 0 0 0 0 870 640 0 310 0 1040 ...
## $ yr_built : int 1967 1990 2006 1967 1951 2008 1995 1983 1987 1959 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ lat : num 47.4 47.7 47.6 47.8 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
## $ sqft_lot15 : int 7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
## $ dataType : chr "train" "train" "train" "train" ...
We will visualize all numeric columns against the house price.
numeric_var <- names(train)[which(sapply(train, is.numeric))]
for (i in seq_along(numeric_var)) {
plt <- ggplot(data = train, aes_string(numeric_var[i], 'price')) +
geom_point(color = "red") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
print(plt)
}
Removing outliers manually according to the visualization of every variable against the target variable
train <- subset (train, bedrooms < 20)
train <- subset (train, sqft_living < 11000)
dfCor <- cor(train[sapply(train, is.numeric)])
corrplot(dfCor)
## Check for the Distribution of the price
g1<-ggplot(dataset,aes(x=price))+geom_density(fill="blue")
g2<-ggplot(dataset,aes(x=sqft_living))+geom_histogram(binwidth=1,fill="green")
grid.arrange(g1,g2,nrow=1,ncol=2)
## Warning: Removed 4320 rows containing non-finite values (stat_density).
## The distribution of house prices & sqft_living is rightly skewed, so let's apply log10() and then plot the distribution again
dataset<-dataset %>%
mutate(log_price=log10(price))
ggplot(dataset,aes(x=log_price))+geom_histogram(fill="green",binwidth=0.10)
## Warning: Removed 4320 rows containing non-finite values (stat_bin).
dataset<-dataset %>%
mutate(log_size=log10(sqft_living))
ggplot(dataset,aes(x=log_size))+geom_histogram(fill="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Check for the distribution of the price against the bedrooms variable.
ggplot(dataset,aes(x=bedrooms,y=log_price,col=bedrooms))+
geom_point(alpha=0.5,size=2)+
geom_smooth(method="lm",se=F)+
labs("title=Sqft Living vs Price")+scale_color_gradientn(colors='yellow')+theme(legend.position="none")
## Warning: Removed 4320 rows containing non-finite values (stat_smooth).
## Warning: Removed 4320 rows containing missing values (geom_point).
Check for the distribution of the bathrooms variable.
ggplot(dataset,aes(x=bathrooms))+geom_histogram(fill="green4",binwidth=0.5,size=0.1)+
scale_x_continuous(limits=c(1,8))
## Warning: Removed 75 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Check for the distribution of the house prices in respect to the condition of the house.
ggplot(dataset,aes(factor(condition),log_price,fill=factor(condition)))+
geom_boxplot(alpha=0.6)+scale_fill_manual(values=rainbow(6))+
theme(legend.position="none")+
labs(x="House Condition")
## Warning: Removed 4320 rows containing non-finite values (stat_boxplot).
Interpretation of the boxplot: From the boxplot, it is very clear that having a high house condition means that the price of the house will be higher.
Creating a new variable ‘houseview’, indicating that if the house has a view of the waterfront it is true, else it is false.
dataset$houseview<-ifelse(dataset$waterfront==1,TRUE,FALSE)
dataset %>%
select(log_price, houseview) %>%
glimpse()
## Observations: 21,597
## Variables: 2
## $ log_price <dbl> 5.352183, 5.807027, 5.908485, 5.518514, 5.724276, 5....
## $ houseview <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
ggplot(dataset, aes(x = houseview, y = log_price,fill=houseview)) +
geom_boxplot(alpha=0.5) +
labs(x = "waterfront", y = "logprice")+
scale_fill_manual(values=rainbow(n=12))
## Warning: Removed 4320 rows containing non-finite values (stat_boxplot).
Interpretation of the boxplot: The houses that have a view of the waterfront tend to be much more expensive than the houses that don’t have a view of the waterfront.
1- Creating a new variable to check if there is an outdoor area in the lot which is given by the difference between the sqft_lot and the sqft_living.
2- First extract the year from the date sold (yr_sold). Then, create a new variable which indicates the age of the house, given from the difference between the year of the date where the house was sold and the year the house was built.
If the difference is equal to zero it means that the house is new.
3- Creating a new variable ‘renovated’ indicating if the house is renovated or not.
4- Since zipcodes are a representation of the area, the average price per zipcode and the number of houses per zipcode will be added to the dataset (keep in mind that all the values added were only taken from the train set since practically the information from the test set would not be available).
The dataset will be splited here once just to be able to separate test from train and not have invalid information. Please keep in mind this split is only to complete feature engineering.
5- In order to make these variables more precise the average per zipcode should not include the price of the house that we are training on, so this formula will be applied ((avg * cnt ) - price) / (cnt - 1))
6- Since we don’t have the prices of the final test set the price avg took an NA value but it is replaceable by the whole mean.
7- Calculating the grade and condition with respect to the neighbors ( grade - avg_grade(per_zip) ) / std_grade(per_zip) and ( condition - avg_condition(per_zip)) / std_condition(per_zip) , please keep in mind that in this case the grade and condition are available informations even from the test set.
# 1
dataset$outdoor <- dataset$sqft_lot - dataset$sqft_living
# 2
dataset$yr_sold <- as.character(dataset$date)
dataset$yr_sold <- substring(dataset$yr_sold, 7, 10)
# if the difference is equal to zero it means that the house is new
dataset$yr_sold <- as.numeric (dataset$yr_sold)
dataset$yr_built <- as.numeric (dataset$yr_built)
dataset$house_age <- dataset$yr_sold - dataset$yr_built
# 3
dataset <- dataset %>% mutate(renovated = ifelse(yr_renovated == 0, 0, 1))
table(dataset$renovated)
##
## 0 1
## 20683 914
# 4
test_bfe <- dataset[ dataset$dataType == 'test',]
train_bfe <- dataset[ dataset$dataType == 'train',]
m <- splitdf(train_bfe , seed = 20)
train_set <- m$trainset
avg_price <- train_set[,c('log_price' , 'zipcode') ] %>% group_by(zipcode)%>% summarise(log_price_avg = mean(log_price) , house_cnt =length(log_price) )
dataset <- join( dataset , avg_price , by ='zipcode')
# 5
dataset$log_price_avg_adj <- ((dataset$log_price_avg * dataset$house_cnt) - dataset$log_price) / (dataset$house_cnt - 1)
# 6
dataset$log_price_avg_adj[is.na(dataset$log_price_avg_adj)] <- dataset$log_price_avg[is.na(dataset$log_price_avg_adj)]
dataset <- dataset[,-which(names(dataset) == 'log_price_avg')]
# 7
grade_subs <- dataset[,c('grade' , 'condition' , 'zipcode') ] %>% group_by(zipcode)%>% summarise(grade_avg = mean(grade) , grade_std =sd(grade) ,condition_avg = mean(condition) , condition_std =sd(condition) )
subset <- dataset[,c('grade' , 'condition' , 'zipcode') ]
subset <- join( subset , grade_subs , by ='zipcode')
subset$resp_grade <- (subset$grade - subset$grade_avg) / subset$grade_std
subset$resp_cond <- (subset$condition - subset$condition_avg) / subset$condition_std
dataset$resp_grade <- subset$resp_grade
dataset$resp_cond<- subset$resp_cond
str(dataset)
## 'data.frame': 21597 obs. of 33 variables:
## $ id : num 9.18e+09 4.64e+08 2.22e+09 6.16e+09 6.39e+09 ...
## $ date : Factor w/ 372 levels "01/02/2015","01/03/2015",..: 155 329 219 355 277 241 201 199 319 255 ...
## $ price : num 225000 641250 810000 330000 530000 ...
## $ bedrooms : int 3 3 4 4 4 4 4 3 4 3 ...
## $ bathrooms : num 1.5 2.5 3.5 1.5 1.75 3.5 3.25 2.25 2.5 1.5 ...
## $ sqft_living : int 1250 2220 3980 1890 1814 3120 4160 1440 2250 2540 ...
## $ sqft_lot : int 7500 2550 209523 7540 5000 5086 47480 10500 6840 9520 ...
## $ floors : num 1 3 2 1 1 2 2 1 2 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 2 2 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 4 4 3 3 3 3 3 ...
## $ grade : int 7 10 9 7 7 9 10 8 9 8 ...
## $ sqft_above : int 1250 2220 3980 1890 944 2480 4160 1130 2250 1500 ...
## $ sqft_basement : int 0 0 0 0 870 640 0 310 0 1040 ...
## $ yr_built : num 1967 1990 2006 1967 1951 ...
## $ yr_renovated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98030 98117 98024 98155 98115 98115 98072 98023 98058 98115 ...
## $ lat : num 47.4 47.7 47.6 47.8 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15 : int 1260 2200 2220 1890 1290 1880 3400 1510 2480 1870 ...
## $ sqft_lot15 : int 7563 5610 65775 8515 5000 5092 40428 8125 7386 6800 ...
## $ dataType : chr "train" "train" "train" "train" ...
## $ log_price : num 5.35 5.81 5.91 5.52 5.72 ...
## $ log_size : num 3.1 3.35 3.6 3.28 3.26 ...
## $ houseview : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ outdoor : int 6250 330 205543 5650 3186 1966 43320 9060 4590 6980 ...
## $ yr_sold : num 2014 2014 2014 2015 2014 ...
## $ house_age : num 47 24 8 48 63 6 19 31 27 55 ...
## $ renovated : num 0 0 0 0 0 0 0 0 0 0 ...
## $ house_cnt : int 141 308 53 233 309 309 145 265 235 309 ...
## $ log_price_avg_adj: num 5.46 5.74 5.74 5.58 5.77 ...
## $ resp_grade : num -0.487 2.977 0.917 -0.201 -0.45 ...
## $ resp_cond : num -0.583 -0.76 -0.503 0.737 0.657 ...
# Factorizing some variables
columns <- c( "floors", "waterfront", "view", "condition")
dataset[columns] <- lapply(dataset[columns], factor)
# Replacing True or False by 1 and 0
dataset <- dataset%>% mutate(houseview = ifelse(houseview == FALSE, 0, 1))
test_fe <- dataset[ dataset$dataType == 'test',]
train_fe <- dataset[ dataset$dataType == 'train',]
m <- splitdf(train_fe , seed = 20)
train_set <- m$trainset
val_set <- m$testset
train_set <- train_set[,!(names(train_set) %in% c("dataType" , 'price' , 'date' , 'id' ))]
val_set <- val_set[,!(names(val_set) %in% c("dataType" , 'price' , 'date', 'id' ))]
names(val_set)
## [1] "bedrooms" "bathrooms" "sqft_living"
## [4] "sqft_lot" "floors" "waterfront"
## [7] "view" "condition" "grade"
## [10] "sqft_above" "sqft_basement" "yr_built"
## [13] "yr_renovated" "zipcode" "lat"
## [16] "long" "sqft_living15" "sqft_lot15"
## [19] "log_price" "log_size" "houseview"
## [22] "outdoor" "yr_sold" "house_age"
## [25] "renovated" "house_cnt" "log_price_avg_adj"
## [28] "resp_grade" "resp_cond"
names(train_set)
## [1] "bedrooms" "bathrooms" "sqft_living"
## [4] "sqft_lot" "floors" "waterfront"
## [7] "view" "condition" "grade"
## [10] "sqft_above" "sqft_basement" "yr_built"
## [13] "yr_renovated" "zipcode" "lat"
## [16] "long" "sqft_living15" "sqft_lot15"
## [19] "log_price" "log_size" "houseview"
## [22] "outdoor" "yr_sold" "house_age"
## [25] "renovated" "house_cnt" "log_price_avg_adj"
## [28] "resp_grade" "resp_cond"
Building the model
str(train_set)
## 'data.frame': 11518 obs. of 29 variables:
## $ bedrooms : int 3 4 3 4 4 2 4 3 3 3 ...
## $ bathrooms : num 2.75 2.5 2.5 3.25 2.75 1 2.75 1.5 1.5 2 ...
## $ sqft_living : int 3010 2200 1040 3420 2260 810 3410 1460 1260 1390 ...
## $ sqft_lot : int 12432 9099 1032 5012 12005 4368 95396 78408 7964 6005 ...
## $ floors : Factor w/ 6 levels "1","1.5","2",..: 1 3 5 3 1 1 2 1 1 3 ...
## $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ view : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ condition : Factor w/ 5 levels "1","2","3","4",..: 4 3 3 3 4 3 4 4 4 3 ...
## $ grade : int 8 8 7 10 8 6 10 7 7 8 ...
## $ sqft_above : int 1890 2200 1040 2330 2260 810 3410 1460 1260 1390 ...
## $ sqft_basement : int 1120 0 0 1090 0 0 0 0 0 0 ...
## $ yr_built : num 1970 1994 2007 2008 1956 ...
## $ yr_renovated : int 0 0 0 0 1989 0 0 0 0 0 ...
## $ zipcode : int 98052 98034 98107 98019 98005 98115 98005 98075 98148 98022 ...
## $ lat : num 47.6 47.7 47.7 47.7 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15 : int 2500 2270 1290 2320 1870 1890 3721 3320 1510 1264 ...
## $ sqft_lot15 : int 12432 8900 1275 5465 10800 5253 35352 7787 8776 5550 ...
## $ log_price : num 5.87 5.77 5.64 5.74 5.87 ...
## $ log_size : num 3.48 3.34 3.02 3.53 3.35 ...
## $ houseview : num 0 0 0 0 0 0 0 0 0 0 ...
## $ outdoor : int 9422 6899 -8 1592 9745 3558 91986 76948 6704 4615 ...
## $ yr_sold : num 2014 2015 2015 2014 2014 ...
## $ house_age : num 44 21 8 6 58 57 52 52 59 9 ...
## $ renovated : num 0 0 0 0 1 0 0 0 0 0 ...
## $ house_cnt : int 313 284 131 90 94 309 94 195 35 119 ...
## $ log_price_avg_adj: num 5.79 5.68 5.74 5.61 5.89 ...
## $ resp_grade : num -0.189 0.504 -0.552 2.629 -0.407 ...
## $ resp_cond : num 1.277 -0.629 -0.609 -0.221 0.451 ...
baseline <- lm(formula = log_price ~ . , data = train_set )
summary(baseline)
##
## Call:
## lm(formula = log_price ~ ., data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55085 -0.04535 0.00238 0.04744 0.50907
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.447e+01 3.708e+00 -17.387 < 2e-16 ***
## bedrooms -6.417e-03 1.062e-03 -6.043 1.56e-09 ***
## bathrooms 1.785e-02 1.832e-03 9.745 < 2e-16 ***
## sqft_living 1.356e-05 3.670e-06 3.694 0.000222 ***
## sqft_lot 2.153e-07 2.722e-08 7.910 2.80e-15 ***
## floors1.5 1.318e-02 3.058e-03 4.312 1.63e-05 ***
## floors2 1.868e-03 2.455e-03 0.761 0.446686
## floors2.5 2.344e-02 8.857e-03 2.646 0.008151 **
## floors3 -1.671e-02 5.502e-03 -3.037 0.002397 **
## floors3.5 1.567e-02 8.135e-02 0.193 0.847275
## waterfront1 1.855e-01 1.058e-02 17.534 < 2e-16 ***
## view1 4.263e-02 6.247e-03 6.824 9.28e-12 ***
## view2 4.480e-02 3.848e-03 11.642 < 2e-16 ***
## view3 6.830e-02 5.207e-03 13.118 < 2e-16 ***
## view4 1.116e-01 7.905e-03 14.113 < 2e-16 ***
## condition2 -1.150e-02 2.445e-02 -0.470 0.638237
## condition3 4.309e-02 2.400e-02 1.795 0.072626 .
## condition4 4.990e-02 2.576e-02 1.937 0.052796 .
## condition5 7.439e-02 2.790e-02 2.666 0.007678 **
## grade -1.599e-02 2.785e-03 -5.742 9.57e-09 ***
## sqft_above 2.781e-05 2.514e-06 11.063 < 2e-16 ***
## sqft_basement NA NA NA NA
## yr_built -6.085e-04 4.518e-05 -13.469 < 2e-16 ***
## yr_renovated 1.526e-03 2.410e-04 6.332 2.51e-10 ***
## zipcode 8.777e-06 1.923e-05 0.456 0.648157
## lat 1.282e-01 8.188e-03 15.664 < 2e-16 ***
## long -1.078e-01 8.028e-03 -13.434 < 2e-16 ***
## sqft_living15 2.449e-05 2.002e-06 12.233 < 2e-16 ***
## sqft_lot15 1.036e-07 4.284e-08 2.418 0.015619 *
## log_size 2.554e-01 1.534e-02 16.651 < 2e-16 ***
## houseview NA NA NA NA
## outdoor NA NA NA NA
## yr_sold 2.277e-02 1.618e-03 14.073 < 2e-16 ***
## house_age NA NA NA NA
## renovated -3.023e+00 4.810e-01 -6.285 3.39e-10 ***
## house_cnt 5.986e-05 1.047e-05 5.716 1.12e-08 ***
## log_price_avg_adj 7.755e-01 1.024e-02 75.762 < 2e-16 ***
## resp_grade 6.228e-02 2.633e-03 23.654 < 2e-16 ***
## resp_cond 4.242e-03 2.752e-03 1.541 0.123298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08074 on 11483 degrees of freedom
## Multiple R-squared: 0.8756, Adjusted R-squared: 0.8753
## F-statistic: 2378 on 34 and 11483 DF, p-value: < 2.2e-16
predicted_lm <- as.data.frame(predict(baseline , val_set))
## Warning in predict.lm(baseline, val_set): prediction from a rank-deficient
## fit may be misleading
MAPE_lm <- mean(abs((10^(val_set$log_price)-10^(predicted_lm$`predict(baseline, val_set)`))/10^(val_set$log_price)))
RMSE_lm <- sqrt(mean((10^(val_set$log_price)-10^(predicted_lm$`predict(baseline, val_set)`))^2))
print(paste0( 'The MAPE of the Linear Model is : ' , MAPE_lm , " and the RMSE is : " , RMSE_lm))
## [1] "The MAPE of the Linear Model is : 0.144899053375087 and the RMSE is : 139574.135054944"
Check for the most important variables
imp <- varImp(baseline)
imp
## Overall
## bedrooms 6.0431450
## bathrooms 9.7453620
## sqft_living 3.6937942
## sqft_lot 7.9103114
## floors1.5 4.3118611
## floors2 0.7609785
## floors2.5 2.6461973
## floors3 3.0367595
## floors3.5 0.1926008
## waterfront1 17.5339227
## view1 6.8243793
## view2 11.6421003
## view3 13.1181148
## view4 14.1131877
## condition2 0.4701783
## condition3 1.7953392
## condition4 1.9367845
## condition5 2.6663636
## grade 5.7423478
## sqft_above 11.0628665
## yr_built 13.4691122
## yr_renovated 6.3321317
## zipcode 0.4563364
## lat 15.6635334
## long 13.4340763
## sqft_living15 12.2330211
## sqft_lot15 2.4180500
## log_size 16.6510156
## yr_sold 14.0730860
## renovated 6.2852632
## house_cnt 5.7160489
## log_price_avg_adj 75.7624890
## resp_grade 23.6540469
## resp_cond 1.5411911
Random Forest Model :
str(train_set)
## 'data.frame': 11518 obs. of 29 variables:
## $ bedrooms : int 3 4 3 4 4 2 4 3 3 3 ...
## $ bathrooms : num 2.75 2.5 2.5 3.25 2.75 1 2.75 1.5 1.5 2 ...
## $ sqft_living : int 3010 2200 1040 3420 2260 810 3410 1460 1260 1390 ...
## $ sqft_lot : int 12432 9099 1032 5012 12005 4368 95396 78408 7964 6005 ...
## $ floors : Factor w/ 6 levels "1","1.5","2",..: 1 3 5 3 1 1 2 1 1 3 ...
## $ waterfront : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ view : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ condition : Factor w/ 5 levels "1","2","3","4",..: 4 3 3 3 4 3 4 4 4 3 ...
## $ grade : int 8 8 7 10 8 6 10 7 7 8 ...
## $ sqft_above : int 1890 2200 1040 2330 2260 810 3410 1460 1260 1390 ...
## $ sqft_basement : int 1120 0 0 1090 0 0 0 0 0 0 ...
## $ yr_built : num 1970 1994 2007 2008 1956 ...
## $ yr_renovated : int 0 0 0 0 1989 0 0 0 0 0 ...
## $ zipcode : int 98052 98034 98107 98019 98005 98115 98005 98075 98148 98022 ...
## $ lat : num 47.6 47.7 47.7 47.7 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15 : int 2500 2270 1290 2320 1870 1890 3721 3320 1510 1264 ...
## $ sqft_lot15 : int 12432 8900 1275 5465 10800 5253 35352 7787 8776 5550 ...
## $ log_price : num 5.87 5.77 5.64 5.74 5.87 ...
## $ log_size : num 3.48 3.34 3.02 3.53 3.35 ...
## $ houseview : num 0 0 0 0 0 0 0 0 0 0 ...
## $ outdoor : int 9422 6899 -8 1592 9745 3558 91986 76948 6704 4615 ...
## $ yr_sold : num 2014 2015 2015 2014 2014 ...
## $ house_age : num 44 21 8 6 58 57 52 52 59 9 ...
## $ renovated : num 0 0 0 0 1 0 0 0 0 0 ...
## $ house_cnt : int 313 284 131 90 94 309 94 195 35 119 ...
## $ log_price_avg_adj: num 5.79 5.68 5.74 5.61 5.89 ...
## $ resp_grade : num -0.189 0.504 -0.552 2.629 -0.407 ...
## $ resp_cond : num 1.277 -0.629 -0.609 -0.221 0.451 ...
### Random Forest model tried , it improved the score and both the MAPE and RMSE were reduced
rf <- randomForest(formula = log_price ~ . , data = train_set)
summary(rf)
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 11518 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 11518 -none- numeric
## importance 28 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 11518 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
predicted_rf <- as.data.frame(predict(rf , val_set))
MAPE_rf <- mean(abs((10^(val_set$log_price)-10^(predicted_rf$`predict(rf, val_set)`))/10^(val_set$log_price)))
RMSE_rf <- sqrt(mean((10^(val_set$log_price)-10^(predicted_rf$`predict(rf, val_set)`))^2))
print(paste0( 'The MAPE of the Random Forest Model is : ' , MAPE_rf , " and the RMSE is : " , RMSE_rf))
## [1] "The MAPE of the Random Forest Model is : 0.118297697513655 and the RMSE is : 132394.388623435"
Here XGBoost is tried, these parameters were tested and improved after understanding each one of them and adapting them to the data we have during multiple trials.
dtrain <- xgb.DMatrix(data.matrix(train_set[,-which(names(train_set) == "log_price")], rownames.force = NA), label = train_set$log_price)
dtest <- xgb.DMatrix(data.matrix(val_set[,-which(names(val_set) == "log_price")], rownames.force = NA), label = val_set$log_price)
watchlist <- list(train = dtrain, eval = dtest)
param <- list(max_depth = 6 , booster = 'gbtree',
eta = 0.2,
silent = 1,
nthread = 2,
min_child_weight = 4 ,
num_parallel_tree = 100 ,
objective = "reg:linear",
eval_metric = "rmse")
xgboost <- xgb.train(param, dtrain, nrounds = 100, watchlist )
## [1] train-rmse:4.136961 eval-rmse:4.141459
## [2] train-rmse:3.310920 eval-rmse:3.314828
## [3] train-rmse:2.650132 eval-rmse:2.653746
## [4] train-rmse:2.121501 eval-rmse:2.124947
## [5] train-rmse:1.698711 eval-rmse:1.701570
## [6] train-rmse:1.360606 eval-rmse:1.363010
## [7] train-rmse:1.090259 eval-rmse:1.092579
## [8] train-rmse:0.874113 eval-rmse:0.876011
## [9] train-rmse:0.701485 eval-rmse:0.703176
## [10] train-rmse:0.563639 eval-rmse:0.565360
## [11] train-rmse:0.453658 eval-rmse:0.455521
## [12] train-rmse:0.366084 eval-rmse:0.368207
## [13] train-rmse:0.296576 eval-rmse:0.299021
## [14] train-rmse:0.241567 eval-rmse:0.244585
## [15] train-rmse:0.198275 eval-rmse:0.201813
## [16] train-rmse:0.164480 eval-rmse:0.168785
## [17] train-rmse:0.138345 eval-rmse:0.143457
## [18] train-rmse:0.118384 eval-rmse:0.124360
## [19] train-rmse:0.103550 eval-rmse:0.110331
## [20] train-rmse:0.092256 eval-rmse:0.099899
## [21] train-rmse:0.084403 eval-rmse:0.092805
## [22] train-rmse:0.078030 eval-rmse:0.087198
## [23] train-rmse:0.074071 eval-rmse:0.083782
## [24] train-rmse:0.070971 eval-rmse:0.081310
## [25] train-rmse:0.068614 eval-rmse:0.079446
## [26] train-rmse:0.067092 eval-rmse:0.078301
## [27] train-rmse:0.065784 eval-rmse:0.077405
## [28] train-rmse:0.064433 eval-rmse:0.076326
## [29] train-rmse:0.063803 eval-rmse:0.075898
## [30] train-rmse:0.063133 eval-rmse:0.075505
## [31] train-rmse:0.062620 eval-rmse:0.075236
## [32] train-rmse:0.061819 eval-rmse:0.074651
## [33] train-rmse:0.061481 eval-rmse:0.074538
## [34] train-rmse:0.060959 eval-rmse:0.074273
## [35] train-rmse:0.060734 eval-rmse:0.074169
## [36] train-rmse:0.059760 eval-rmse:0.073415
## [37] train-rmse:0.059366 eval-rmse:0.073315
## [38] train-rmse:0.059155 eval-rmse:0.073253
## [39] train-rmse:0.058887 eval-rmse:0.073115
## [40] train-rmse:0.058173 eval-rmse:0.072784
## [41] train-rmse:0.057851 eval-rmse:0.072618
## [42] train-rmse:0.057615 eval-rmse:0.072556
## [43] train-rmse:0.057398 eval-rmse:0.072474
## [44] train-rmse:0.057129 eval-rmse:0.072442
## [45] train-rmse:0.057014 eval-rmse:0.072442
## [46] train-rmse:0.056731 eval-rmse:0.072303
## [47] train-rmse:0.056322 eval-rmse:0.072050
## [48] train-rmse:0.055873 eval-rmse:0.071854
## [49] train-rmse:0.055477 eval-rmse:0.071728
## [50] train-rmse:0.055190 eval-rmse:0.071646
## [51] train-rmse:0.055054 eval-rmse:0.071584
## [52] train-rmse:0.054635 eval-rmse:0.071274
## [53] train-rmse:0.054348 eval-rmse:0.071181
## [54] train-rmse:0.054137 eval-rmse:0.071179
## [55] train-rmse:0.053497 eval-rmse:0.070786
## [56] train-rmse:0.053112 eval-rmse:0.070724
## [57] train-rmse:0.052962 eval-rmse:0.070671
## [58] train-rmse:0.052699 eval-rmse:0.070582
## [59] train-rmse:0.052418 eval-rmse:0.070344
## [60] train-rmse:0.052259 eval-rmse:0.070340
## [61] train-rmse:0.051795 eval-rmse:0.070268
## [62] train-rmse:0.051477 eval-rmse:0.070119
## [63] train-rmse:0.051350 eval-rmse:0.070167
## [64] train-rmse:0.051173 eval-rmse:0.070115
## [65] train-rmse:0.051117 eval-rmse:0.070117
## [66] train-rmse:0.050882 eval-rmse:0.070076
## [67] train-rmse:0.050717 eval-rmse:0.069973
## [68] train-rmse:0.050455 eval-rmse:0.069940
## [69] train-rmse:0.050274 eval-rmse:0.069911
## [70] train-rmse:0.050067 eval-rmse:0.069759
## [71] train-rmse:0.049741 eval-rmse:0.069618
## [72] train-rmse:0.049315 eval-rmse:0.069365
## [73] train-rmse:0.049088 eval-rmse:0.069286
## [74] train-rmse:0.048923 eval-rmse:0.069260
## [75] train-rmse:0.048852 eval-rmse:0.069260
## [76] train-rmse:0.048803 eval-rmse:0.069239
## [77] train-rmse:0.048383 eval-rmse:0.069057
## [78] train-rmse:0.048075 eval-rmse:0.068931
## [79] train-rmse:0.047865 eval-rmse:0.068855
## [80] train-rmse:0.047709 eval-rmse:0.068757
## [81] train-rmse:0.047465 eval-rmse:0.068681
## [82] train-rmse:0.047198 eval-rmse:0.068633
## [83] train-rmse:0.046803 eval-rmse:0.068521
## [84] train-rmse:0.046495 eval-rmse:0.068458
## [85] train-rmse:0.046212 eval-rmse:0.068265
## [86] train-rmse:0.045983 eval-rmse:0.068234
## [87] train-rmse:0.045758 eval-rmse:0.068179
## [88] train-rmse:0.045549 eval-rmse:0.068044
## [89] train-rmse:0.045388 eval-rmse:0.068058
## [90] train-rmse:0.045154 eval-rmse:0.067950
## [91] train-rmse:0.045007 eval-rmse:0.067907
## [92] train-rmse:0.044663 eval-rmse:0.067740
## [93] train-rmse:0.044417 eval-rmse:0.067589
## [94] train-rmse:0.044276 eval-rmse:0.067532
## [95] train-rmse:0.044070 eval-rmse:0.067444
## [96] train-rmse:0.043911 eval-rmse:0.067413
## [97] train-rmse:0.043784 eval-rmse:0.067390
## [98] train-rmse:0.043656 eval-rmse:0.067387
## [99] train-rmse:0.043550 eval-rmse:0.067287
## [100] train-rmse:0.043468 eval-rmse:0.067261
# summary(xgboost)
predicted_xgboost <- as.data.frame(predict(xgboost , dtest))
MAPE_xg <- mean(abs((10^(val_set$log_price)-10^(predicted_xgboost$`predict(xgboost, dtest)`))/10^(val_set$log_price)))
RMSE_xg <- sqrt(mean((10^(val_set$log_price)-10^(predicted_xgboost$`predict(xgboost, dtest)`))^2))
print(paste0( 'The MAPE of the XGBoost Model is : ' , MAPE_xg , " and the RMSE is : " , RMSE_xg))
## [1] "The MAPE of the XGBoost Model is : 0.110969885696949 and the RMSE is : 118411.115362964"
Since XGBoost gave the best training score this is the model that will be used for predictions of the final test
d_f_test<- xgb.DMatrix(data.matrix(test_fe[,!(names(test_fe) %in% c("dataType" , 'price' , 'date' , 'id' ,'log_price'))], rownames.force = NA), label = test_fe$log_price)
test_fe$price_pred <- 10^predict(xgboost , d_f_test)
id_pred <- test_fe[,c('id', 'price_pred')]
write.csv(id_pred , 'test_prediction.csv')